128
Applications in Natural Language Processing
computation of GELU and Softmax, a known algorithm [49] for integer calculation of
square root is utilized to perform integer-only computation for LayerNorm. Finally, an
integral framework is introduced by exploiting these approximations of GELU, Softmax,
and LayerNorm. The illustration of I-BERT is presented in the right side of Fig. 5.4.
5.4.1
Integer-Only Computation of GELU and Softmax
For the integer-only computation of GELU and Softmax, they use a class of interpolating
polynomials to approximate the function. Given the function value for a set of n+1 different
data points {(x0, f0), ..., (xn, fn)}, the target is to find a polynomial of degree at most n that
exactly matches the function value at these points. They argue that a unique polynomial of
degree at most n passes through all the data points and proposes an analytic solution for
the target polynomial.
5.4.2
Integer-Only Computation of LayerNorm
For integer-only LayerNorm, the challenge is that the input statistics (i.e., μ, and σ) change
rapidly along with training, and these values must be calculated dynamically during run-
time. Computing μ is straightforward, however, evaluating σ requires the square-root func-
tion. To approximate the square-root function by integer-only calculation, the authors adopt
an efficiently iterative algorithm proposed in [49]. Given any non-negative integer input n,
based on the Newton’s Method, the algorithm iteratively searches for the exact value of
⌊√n⌋and only requires integer arithmetic. Then, the rest of the non-linear operations in
LayerNorm such as division and square are straightforwardly computed with integer arith-
metic.
The integer-only quantization results for RoBERTa-Base/Large are presented in Ta-
ble 5.3. As one can see, I-BERT consistently achieves comparable or slightly higher accuracy
than the baseline. For RoBERTa-Base, I-BERT achieves higher accuracy for all cases (up to
1.4 for RTE), except for MNLI-m, QQP, and STS-B tasks. Also, a similar behavior on the
RoBERTa-Large model can be observed, where I-BERT matches or outperforms the base-
line accuracy for all the downstream tasks. On average, I-BERT outperforms the baseline
by 0.3/0.5 for RoBERTa-Base/Large, respectively.
TABLE 5.3
I-BERT quantization result for RoBERTa-Base and RoBERTa-Large on the
development set of the GLUE benchmark. Baseline is trained from the pre-trained
models, and I-BERT is quantized and fine-tuned from the baseline.
RoBERTa-Base
Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
Baseline
FP32
87.8
87.4
90.4 92.8
94.6
61.2
91.1
90.9
78.0 86.0
I-BERT
INT8
87.5
87.4
90.2
92.8 95.2 62.5
90.8
91.1 79.4 86.3
Diff
-0.3
0.0
-0.2
0.0
+0.6 +1.3
-0.3
+0.2 +1.4 +0.3
RoBERTa-Large
Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
Baseline
FP32
90.0
89.9
92.8
94.1
96.3
68.0
92.2
91.8
86.3 89.0
I-BERT
INT8
90.4
90.3
93.0 94.5 96.4 69.0
92.2
93.0 87.0 89.5
Diff
+0.4
+0.4
+0.2 +0.4 +0.1 +1.0
0.0
+1.2 +0.7 +0.5